Word-pair extraction for lexicography
نویسندگان
چکیده
We describe an application of sentence alignment techniques and approximate string matching to the problem of extracting lexicographically interesting word-word pairs from multilingual corpora. Since our interest is in support systems for lexicographers rather than in fully automatic construction of lexicons, we would like to provide access to parameters allowing a tunable trade-oo between precision and recall. We evaluate two techniques for doing this. Since sentence alignment tends to associate semantically similar words, approximate string matching draws attention to orthographic similarities, they can be used to serve diierent lexicographic purposes, as can the combination of the two techniques, which amounts, inter alia, to a tool for uncovering faux amis. We conclude by sketching a simple and exible means for allowing lexicographers to provide information which has the potential to improve system performance.
منابع مشابه
Computational Lexicography and Lexicology Elexbi, a Basic Tool for Bilingual Term Extraction from Spanish-Basque Parallel Corpora
We present the work done by Elhuyar Foundation in the field of bilingual terminology extraction. The aim ofthis work is to develop some techniques for the automatic extraction ofpairs ofequivalent terms from Spanish-Basque translation memories, and to implement those techniques in a prototype. Our approach is based on a monolingual extraction of term candidates in each language, then the creati...
متن کاملThe Application of Fuzzy Logic to Collocation Extraction
Collocations are important for many tasks of Natural language processing such as information retrieval, machine translation, computational lexicography etc. So far many statistical methods have been used for collocation extraction. Almost all the methods form a classical crisp set of collocation. We propose a fuzzy logic approach of collocation extraction to form a fuzzy set of collocations in ...
متن کاملStatistical Alignment Models for Translational Equivalence
The ever-increasing amount of parallel data opens a rich resource to multilingual natural language processing, enabling models to work on various translational aspects like detailed human annotations, syntax and semantics. With efficient statistical models, many cross-language applications have seen significant progresses in recent years, such as statistical machine translation, speech-to-speec...
متن کاملCorpus Analysis for Lexical Database Construction: A Case of Russian and Czech Wordnets
The paper deals with corpus-based methods applied to the particular tasks of lexical database construction. Different techniques of the corpus analysis are discussed and their applicability for the tasks is assessed. Corpus management system Manatee + Bonito developed at the Faculty of Informatics, Masaryk University in Brno, Czech Republic, is presented as a tool that enables to perform all di...
متن کاملELexBI, A BASIC TOOL FOR BILINGUAL TERM EXTRACTION FROM SPANISH-BASQUE PARALLEL CORPORA
We present the work done by Elhuyar Foundation in the field of bilingual terminology extraction. The aim of this work is to develop some techniques for the automatic extraction of pairs of equivalent terms from Spanish-Basque translation memories, and to implement those techniques in a prototype. Our approach is based on a previous monolingual extraction of term candidates in each language, the...
متن کامل